Red Wine Quality analysis by Harish Garg

Univariate Plots Section

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Univariate Analysis

What is the structure of your dataset?

This dataset has 1599 observations and 13 variables. These 1599 observations correspond to 1599 types of red wines.

What is/are the main feature(s) of interest in your dataset?

  • “quality” is the dependent variable.
  • Rest of the 12 variables are independent variables. We will using the how the 12 independent variables relate to the depedent variable i.e. quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Let’s begin with finding the correlation between each independent variable and the depedent variable.

abs(round(cor(wines),3))[-12,"quality"]
##                    X        fixed.acidity     volatile.acidity 
##                0.066                0.124                0.391 
##          citric.acid       residual.sugar            chlorides 
##                0.226                0.014                0.129 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##                0.051                0.185                0.175 
##                   pH            sulphates              quality 
##                0.058                0.251                1.000

Results seems to suggest we don’t none of the indepedent variables have strong correlation with the quality. So, we would need to work with mutiple independent variables to see if we get a stronger correlation with quality.

Did you create any new variables from existing variables in the dataset?

Not yet. Maybe will update this section, if I do create more variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

We are going to turn the quality variable into a factor as this will help us make it a classfication problem.

wines$quality = as.factor(wines$quality)

summary(wines$quality)
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
table(wines$quality)
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Bivariate Plots Section

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Variables that don’t change much with quality.. - fixed.acidity - resdiual.sugar - chlorides

Variables that decrease as the quality gets higher… - volatile.acidity - density - pH

Variables that increase as the quality gets higher - citric.acid - sulphates - alcohol

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

abs(round(cor(wines[,-c(12,13)]),3))
##                          X fixed.acidity volatile.acidity citric.acid
## X                    1.000         0.268            0.009       0.154
## fixed.acidity        0.268         1.000            0.256       0.672
## volatile.acidity     0.009         0.256            1.000       0.552
## citric.acid          0.154         0.672            0.552       1.000
## residual.sugar       0.031         0.115            0.002       0.144
## chlorides            0.120         0.094            0.061       0.204
## free.sulfur.dioxide  0.090         0.154            0.011       0.061
## total.sulfur.dioxide 0.118         0.113            0.076       0.036
## density              0.368         0.668            0.022       0.365
## pH                   0.136         0.683            0.235       0.542
## sulphates            0.125         0.183            0.261       0.313
##                      residual.sugar chlorides free.sulfur.dioxide
## X                             0.031     0.120               0.090
## fixed.acidity                 0.115     0.094               0.154
## volatile.acidity              0.002     0.061               0.011
## citric.acid                   0.144     0.204               0.061
## residual.sugar                1.000     0.056               0.187
## chlorides                     0.056     1.000               0.006
## free.sulfur.dioxide           0.187     0.006               1.000
## total.sulfur.dioxide          0.203     0.047               0.668
## density                       0.355     0.201               0.022
## pH                            0.086     0.265               0.070
## sulphates                     0.006     0.371               0.052
##                      total.sulfur.dioxide density    pH sulphates
## X                                   0.118   0.368 0.136     0.125
## fixed.acidity                       0.113   0.668 0.683     0.183
## volatile.acidity                    0.076   0.022 0.235     0.261
## citric.acid                         0.036   0.365 0.542     0.313
## residual.sugar                      0.203   0.355 0.086     0.006
## chlorides                           0.047   0.201 0.265     0.371
## free.sulfur.dioxide                 0.668   0.022 0.070     0.052
## total.sulfur.dioxide                1.000   0.071 0.066     0.043
## density                             0.071   1.000 0.342     0.149
## pH                                  0.066   0.342 1.000     0.197
## sulphates                           0.043   0.149 0.197     1.000

The strongest relationship is between pH and fixed.acidity(0.683)

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Below variables, together seems to have a interesting relationships for distingushing between higher quality and lower quality wines - alcohol & chlorides - alcohol & volatile.acidity - alcohol & sulphates - sulphates & volatile.acidity

Were there any interesting or surprising interactions between features?

No, didn’t see any worth mentioning.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

A very high number of wines are of the the quality 5 or 6(> 4/5ths).

Plot Two

Description Two

As the wine quality increases, the median value of variables sulphates, alcohol & citric.acid increase and the median value of variables volatile.acidity, density & pH decrease.

Plot Three

Description Three

Two variables combinations gives us a little bit of insight into the differenc between wines with higher and lower quality.


Reflection

We started with trying to find out variables, individually or in combination, that influence the quality of the wine. We conclude that no single variable can by it’s own predict the quality of the wine. We would need to use mutiple variables and do more analysis, proabbly with more data.